tcgaCleaneR: removing unwanted variation the TCGA rectum adenocarcinoma RNA-seq data
Last updated on: 2022-04-29
1 Introduction
In this vignette, we show how to use the tcgaCleaneR package to remove unwanted variation from the TCGA rectum adenocarcinoma RNA-seq data. We refer to ‘TCGA_READ_RNAseq_Vignette.Rmd’ for more detail about.
1.1 Data preparation
All datasets that are required for this vignette can be found here link
2 TCGA READ gene expression data
2.1 RNA-seq data
We load the TCGA_SummarizedExperiment_HTseq_READ.rds file. This is a SummarizedExperiment object that contains:
assays:
-Raw counts
-FPKM
-FPKM.UQ
colData:
-Batch information
-Clinical information (collected from different resources)
rowData:
-Genes’ details (GC, chromosome, …)
-Several lists of housekeeping genes
The lists of housekeeping genes might be suitable to use as negative control genes (NCG) for RUV-III normalization.
The Libraries_HelperFunctions_ForTcgaReadVignette.R containes all helper functions that are required for this vignette.
source('Libraries_HelperFunctions_ForTcgaReadVignette.R')
read.se <- readRDS(
'../TCGA_SummarizedExperiment_HTseq_READ.rds'
)
# source('Scripts/Libraries_HelperFunctions_ForTcgaReadVignette.R')
# read.se <- readRDS(
# 'TCGA_SummarizedExperiment_HTseq_READ.rds'
# )
n.cores = 52.2 Microarray gene expression data
We also load the TCGA READ microarray gene expression data. This data will be used as an orthogonal platform to assess the performance of different RNA-seq normalizations. The data was downloaded from the TCGA firehouse repositories. This data contains 17814 gene and 72 samples.
read.micro.array <- base::readLines(
'READ.transcriptome__agilentg4502a_07_3__unc_edu__Level_3__unc_lowess_normalization_gene_level__data.data.txt')
read.micro.array <- read.delim(
file = base::textConnection(read.micro.array[-2]),
row.names = 1,
stringsAsFactors = FALSE
)
read.micro.array <- read.micro.array[
complete.cases(read.micro.array) ,
]
colnames(read.micro.array) <- gsub(
'\\.',
'-',
colnames(read.micro.array)
)
row.lables <- row.names(read.micro.array)
read.micro.array <- apply(
read.micro.array,
2,
as.numeric
)
row.names(read.micro.array) <- row.lables
### Sample annotation
index.cancer <- read.se$tissue == 'cancer'
read.micro.array.sampleAnnot <- as.data.frame(
SummarizedExperiment::colData(read.se[ , index.cancer])
)
common.samples <- intersect(
read.micro.array.sampleAnnot$Sample,
colnames(read.micro.array)
) # 67
read.micro.array.sampleAnnot <- read.micro.array.sampleAnnot[
common.samples , ]
read.micro.array <- read.micro.array[ , common.samples] # 17812 genes * 67 samples2.3 Removing genes and plates or samples
Here, we explain what kind of genes and plates or samples are removed before any down-stream analysis.
2.3.1 Lowly expressed genes
We identify lowly expressed genes in cancer and normal samples separately. Genes with at least 15 raw counts in at least 10% of cancer and normal samples are retained for down-stream analyses. These details cab be found in the “keep.cancer” and ‘keep.normal’ columns of the gene annotation file in the SummarizedExperiment object.
normal.tissues <- SummarizedExperiment::colData(read.se)$tissue == 'normal'
cancer.tissues <- SummarizedExperiment::colData(read.se)$tissue == 'cancer'
keep.genes.normal <- apply(
SummarizedExperiment::assay(read.se[ , normal.tissues], 'HTseq_counts'),
1,
function(x) length(x[x > 15]) >= round(.1*sum(normal.tissues), digits = 0)
)
keep.genes.normal <- names(keep.genes.normal[keep.genes.normal == TRUE])
keep.genes.cancer <- apply(
SummarizedExperiment::assay(read.se[ , cancer.tissues], 'HTseq_counts'),
1,
function(x) length(x[x > 15]) >= round(.1*sum(cancer.tissues), digits = 0)
)
keep.genes.cancer <- names(
keep.genes.cancer[keep.genes.cancer == TRUE])2.3.2 Keep protein conding genes
We also keep only protein coding genes. This detail can be found in the “gene_type” column of the gene annotation file. This is a arbitrary filtering, any gene_type of interest can be retained in the data.
proteinCoding.index <- as.data.frame(
SummarizedExperiment::rowData(read.se)
)$gene_type. == 'protein.coding'
keep.proteinCoding <- as.data.frame(
SummarizedExperiment::rowData(read.se)
)$gene_id.v[proteinCoding.index]
selected.genes <- intersect(
unique(c(keep.genes.normal, keep.genes.cancer)),
keep.proteinCoding)
SummarizedExperiment::rowData(read.se)$selected.genes <- 'remove'
SummarizedExperiment::rowData(read.se)$selected.genes[
SummarizedExperiment::rowData(read.se)$gene_id.v
%in% selected.genes] <- 'keep'2.3.3 Remove genes that have no or duplicated ENTREZ or gene symbol ids
Further, we remove genes without or duplicated ENTREZ gene id or gene symbol.
### entrez ids
SummarizedExperiment::rowData(read.se)$entrezgene.use <- 'keep'
na.duplicated <-
is.na(SummarizedExperiment::rowData(read.se)$entrezgene_id_BioMart) |
duplicated(SummarizedExperiment::rowData(read.se)$entrezgene_id_BioMart)
SummarizedExperiment::rowData(read.se)$entrezgene.use[na.duplicated] <-
'remove'
### gene symbol
SummarizedExperiment::rowData(read.se)$geneName.use <- 'keep'
na.duplicated <-
is.na(SummarizedExperiment::rowData(read.se)$gene_name.) |
duplicated(SummarizedExperiment::rowData(read.se)$gene_name.)
SummarizedExperiment::rowData(read.se)$geneName.use[na.duplicated] <-
'remove'
keep.genes <-
SummarizedExperiment::rowData(read.se)$entrezgene.use == 'keep' &
SummarizedExperiment::rowData(read.se)$geneName.use == 'keep'
read.se <- read.se[keep.genes,]2.3.4 Keep genes that are required for the CMS identification
Colorectal cancers are classified into four transcriptomic-based subtypes, consensus molecular subtypes (CMS), with distinct features. We also keep all genes that are used by the CMScaller R package to identify the CMS in the data.
cms.geneSets <- CMScaller::geneSets.CMS
cms.genes <- unname(unlist(cms.geneSets))
SummarizedExperiment::rowData(read.se)$cms.genes <- 'no'
index <-
SummarizedExperiment::rowData(read.se)$entrezgene_id_BioMart %in%
unique(cms.genes)
SummarizedExperiment::rowData(read.se)$cms.genes[index] <- 'yes'
keep.genes <-
SummarizedExperiment::rowData(read.se)$selected.genes == 'keep' |
SummarizedExperiment::rowData(read.se)$cms.genes == 'yes'
read.se <- read.se[keep.genes,]2.3.5 Keep plates with at least 2 samples
The READ RNA-seq study involved 177 assays generated using 14 plates over four years (R.Molania, bioRxiv, 2021). We keep plates with at least three samples for down-stream analyses. The plate “A32Y” has only one sample, so this will be excluded from the analysis.
keep.plates <- names(which(table(read.se$plate_RNAseq) > 2))
keep.plates <- read.se$plate_RNAseq %in% keep.plates
read.se <- read.se[
SummarizedExperiment::rowData(read.se)$gene_id.v ,
keep.plates] # 16327 176After the filterations above, the READ SummarizedExperiment object contains 16327 genes and 176 samples.
2.4 Library size (sequencing-depth)
After removing genes and samples, we compute library size (total counts) and add this to the SummarizedExperiment object. We will use log2 of library size for all down-stream analyses.
read.se$ls <-log2(colSums(
SummarizedExperiment::assay(read.se, 'HTseq_counts')))Figure 2.1 shows the library size of the TCGA READ RNA-seq data across years. Substantial library size differences between samples profiled in 2010 and the rest of the samples are clearly visible.
read.se$Year <- read.se$year_mda
tcgaCleaneR::plotLibSize(data = read.se, plot_type = 'Scatterplot')Figure 2.1: Library size of the TCGA READ RNA-seq data coloured by different years.
2.5 Major time interval
Based on library size variation, we divide samples into two major time intervals, 2010 and 2011:2014, for down-stream analyses. We explored a range of clinical details of samples from the two major time intervals and did not find any specific markers that are associated with the time intervals.
read.se$time.points <- '2010'
index <- read.se$year_mda != 2010
read.se$time.points[index] <- '2011:2014'
read.se$time.points <- factor(
read.se$time.points ,
levels = c(
'2010',
'2011:2014'))2.6 Major gene expression-based biological populations
We identify major gene expression-based biological populations in order to create pseudo-samples (R.Molania, bioRxiv, 2021). In the TCGA READ RNA-seq data, we use the microsatellite instability (MSI) and consensus molecular subtypes (CMS) to create different sets of PRPS.
Note that, any biological populations should roughly show distinct gene expression profile, otherwise they are not helpful for PRPS. For example, we did not find distinct gene expression patterns associated with tumor stage in the TCGA READ RNA-seq data, so we do not use tumor stage for PRPS.
2.6.1 Microsatellite instability (MSI)
The details of the MSI status can be found in the sample annotation file. There are:
msi-h: microsatellite instability high
msi-l: microsatellite instability low
mss: microsatellite stable
indeterminate
Figure 2.2 shows the numbers of individual MSI status in the TCGA READ RNA-Seq data.
colnames(SummarizedExperiment::colData(read.se))[932] <-
'msi.status'
msi.groups <- sort(unique(
SummarizedExperiment::colData(
read.se)$msi.status))
msi.new.names <- c(
'Indeterminate',
'MSI-H',
'MSI-L',
'MSS')
for(i in 1:4){
index <- SummarizedExperiment::colData(
read.se)$msi.status == msi.groups[i]
SummarizedExperiment::colData(
read.se)$msi.status[index] <- msi.new.names[i]
}
index <- read.se$tissue == 'normal'
read.se$msi.status[index] <- 'Adjacent normal'
read.se$msi.status <- factor(
x = read.se$msi.status,
levels = c(
'MSS',
'MSI-L',
'MSI-H',
'Indeterminate',
'Adjacent normal'
))
ggplot(data = as.data.frame(SummarizedExperiment::colData(read.se)),
aes(x = msi.status)) +
geom_bar() +
xlab('MSI') +
theme(
panel.background = element_blank(),
axis.line = element_line(colour = 'black', size = 1),
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14),
axis.text.x = element_text(size = 14, angle = 20, vjust = 1, hjust = 1),
axis.text.y = element_text(size = 14))Figure 2.2: MSI status in the TCGA READ-RNA-Seq data.
2.6.2 Consensus molecular subtypes (CMS)
As we mentioned above, colorectal cancers can be classified into four widely accepted consensus molecular subtypes (CMS) based on their gene expression profiles. This classification provides a framework for stratifying the treatment of patients with colon and rectum cancer. There are CMS1, CMS2, CMS3, CMS4 subtypes.
Here, we use the CMScaller R package to identify the CMS of the TCGA READ RNA-seq samples. The CMScaller provides a classification based on pre-defined cancer-cell intrinsic CMS templates. Figure 2.3 shows heatmaps of the relative expression levels of the CMS marker genes (vertical bar) with classifications indicated below (horizontal bar, white indicating prediction confidence p-values).
tcga.harmonized <- names(SummarizedExperiment::assays(read.se))
index.cancer <- read.se$tissue == 'cancer'
set.seed(2010301149)
par(mfrow = c(1,3))
cms.clusters.cancer.tcga <- lapply(
tcga.harmonized,
function(x){
expr.matrix <- as.matrix(SummarizedExperiment::assay(read.se[ , index.cancer], x))
row.names(expr.matrix) <- SummarizedExperiment::rowData(read.se)$entrezgene_id_BioMart
cms.cluster <- CMScaller::CMScaller(
emat = log2(expr.matrix + .5),
RNAseq = FALSE,
FDR = 0.05,
verbose = FALSE)
return(cms.cluster)
})Figure 2.3: Consensus molecular subtypes (CMS) identification using the CMScaller R package in the TCGA raw counst, FPKM and FPKM.UQ data (from left to right). Heatmaps show the relative expression levels of CMS marker genes (vertical bar) with classifications indicated below (horizontal bar, white indicating prediction confidence p-values).
We add the obtained CMS subtypes to the sample annotation object.
names(cms.clusters.cancer.tcga) <- tcga.harmonized
col.names <- paste0(
'cms.cancer.',
c('rawCounts',
'fpkm',
'fpkmUq'))
for(i in 1:3){
SummarizedExperiment::colData(read.se)[ , col.names[i]] <-
'Adjacent normal'
index <- match(
row.names(cms.clusters.cancer.tcga[[i]]),
read.se$Sample
)
SummarizedExperiment::colData(read.se)[ , col.names[i]][index] <-
as.character(cms.clusters.cancer.tcga[[i]]$prediction)
index <- is.na(SummarizedExperiment::colData(
read.se)[ , col.names[i]]
)
SummarizedExperiment::colData(read.se)[ , col.names[i]][index] <-
'Not classified'
SummarizedExperiment::colData(read.se)[ , col.names[i]] <- factor(
x = SummarizedExperiment::colData(read.se)[ , col.names[i]],
levels = c(
'CMS1',
'CMS2',
'CMS3',
'CMS4',
'Adjacent normal',
'Not classified'))
}We also apply the classifier to samples within the major time intervals (2010 and 2011:2014) using the raw counts, FPKM and FPK.UQ normalized datasets (Figure 2.4). The reason for applying the classifier within each key time interval was to assess the effect of large library differences on the CMS classifications.
set.seed(2010301149)
par(mfrow = c(2,3))
read.cancer.se <- read.se[ , index.cancer]
cms.clusters.time.points.cancer.tcga <- lapply(
levels(read.se$time.points),
function(x){
index.time.points <- read.cancer.se$time.points == x
cms.clusters.cancer.tcga <- lapply(
tcga.harmonized,
function(y){
expr.matrix <- as.matrix(SummarizedExperiment::assay(
read.cancer.se[, index.time.points], y))
row.names(expr.matrix) <- SummarizedExperiment::rowData(
read.cancer.se)$entrezgene_id_BioMart
cms.cluster.time <- CMScaller::CMScaller(
emat = log2(expr.matrix + .5),
RNAseq = FALSE,
FDR = 0.05,
verbose = FALSE)
return(cms.cluster.time)
})
names(cms.clusters.cancer.tcga) <- tcga.harmonized
return(cms.clusters.cancer.tcga)
})Figure 2.4: Identification of consensus molecular subtypes (CMS) in the TCGA READ RNA-seq studies. Heatmaps show the relative expression levels of CMS marker genes (vertical bar) with classifications indicated below (horizontal bar, white indicating prediction confidence p-values). The CMS classification were performed within each key time intervals (first row is2010 and the second row is 2010-2014) to assess the impact of batch effects on the classification.
names(cms.clusters.time.points.cancer.tcga) <- paste0(
'CMS',
'_',
levels(read.se$time.points))Here we add the CMS details to the sample annotation object.
raw.counts <- as.data.frame(rbind(
cms.clusters.time.points.cancer.tcga$CMS_2010[[1]],
cms.clusters.time.points.cancer.tcga$`CMS_2011:2014`[[1]])
)
fpkm <- as.data.frame(rbind(
cms.clusters.time.points.cancer.tcga$CMS_2010[[2]],
cms.clusters.time.points.cancer.tcga$`CMS_2011:2014`[[2]])
)
fpkm.uq <- as.data.frame(rbind(
cms.clusters.time.points.cancer.tcga$CMS_2010[[3]],
cms.clusters.time.points.cancer.tcga$`CMS_2011:2014`[[3]])
)
cms.clusters.time.points.cancer.tcga <- list(
raw.counts = raw.counts,
fpkm = fpkm,
fpkm.uq = fpkm.uq)
col.names <- paste0(
'cms.cancer.time.points.',
c('rawCounts',
'fpkm',
'fpkmUq')
)
for(i in 1:3){
SummarizedExperiment::colData(read.se)[ , col.names[i]] <- 'Adjacent normal'
index <- match(
row.names(cms.clusters.time.points.cancer.tcga[[i]]),
read.se$Sample
)
SummarizedExperiment::colData(read.se)[ , col.names[i]][index] <-
as.character(cms.clusters.time.points.cancer.tcga[[i]]$prediction)
index <- is.na(SummarizedExperiment::colData(read.se)[ , col.names[i]])
SummarizedExperiment::colData(read.se)[ , col.names[i]][index] <- 'Not classified'
SummarizedExperiment::colData(read.se)[ , col.names[i]] <- factor(
x = SummarizedExperiment::colData(read.se)[ , col.names[i]],
levels = c('CMS1',
'CMS2',
'CMS3',
'CMS4',
'Adjacent normal',
'Not classified'))
}3 Study outline
Figure 3.1 shows the outline of the TCGA READ RNA-seq study. This study involved 176 assays generated using 14 plates over four years.
read.se$TSS <- read.se$TSS_mda
read.se$Plates <- read.se$plate_RNAseq
read.se$Tissues <- read.se$tissue
read.se$Center <- read.se$center_mda
read.se$Purity_singscore <- read.se$purity_HTseq_FPKM.UQ
read.se$Tumor.stage <- read.se$tumor_stage__RNAseq
tcgaCleaneR::plotStudyOutline(data = read.se)Figure 3.1: Outline of the TCGA rectum adenocarcinoma RNA-seq study. 176 rectum adenocarcinoma and adjacent normal tissues were collected from 13 tissue source sites (TSS) and distributed across 14 sequencing plates for profiling at 14 time points over a span of 4 years. The consensus molecular subtypes were obtained using the R package CMScaller on the FPKM.UQ normalized data. The MSI status were obtained from the pathological reports in the TCGA clinical data. The library sizes are calculated after removing lowly expressed genes and log2 transformed. The tumour purity scores are obtained from 1- stromal&immune scores.
read.cancer.se <- read.se[ , index.cancer]4 RUV-III normalization
Here, we will explain how to select a suitable set of negative control genes and pseud-replicate of pseudo-samples to remove library size and plate effects from the TCGA READ RNA-seq data.
4.1 Selection of negative control genes (NCG)
First, we need to highlight that in our usage, a negative control gene is one that is not expected to change much across biological factors of interests (R.Molania, NAR, 2019). Second, in the presence of unwanted variation usually a subset of genes are affected in different ways, so we need to emphasize that our approach to negative controls is pragmatic: if using a given gene in RUV-III as one of the set of negative control genes helps, as indicated by various measures, then whether or not it is an ideal negative control gene is moot: it helped. This does not rule out the fact that we may be able to do better by replacing that gene with a different gene designated a negative control. Third, we point out that theoretical analyses not presented here show that it is not the extent to which individual genes designated as negative controls are ideal or less than ideal negative controls that drives the success or otherwise of RUV-III; that is a property of the full set of negative controls. We will frequently get very good results using the entire set of genes being studied as negative controls, even when many genes are changing. (That this is not unreasonable follow from the theory just mentioned.) Fourth, it is usually the case that using more genes as negative control genes is better than fewer. That is a matter of stability, but as stated in the introduction, there is a bias-variance trade-off here: using too many genes as negative controls may be counter-productive. You must look and see. Fifth, endogenous genes generally make more suitable negative controls than spike-ins. The reason here is the obvious one, namely, that endogenous genes have shared the complete sample experience of the other genes, whereas spike-ins can only reflect unwanted variation in the process from the point at which they were added onward. The best source of endogenous negative control genes are ones that were found to be stable in previous studies similar to the one being analyzed.
Here, we explain an approach the we used in our paper R.Molania, bioRxiv, 2021 to select a suitable set of negative control genes for the TCGA READ RNA-seq data.
First, we select samples that they have the same CMS subtypes obtained by applying the CMS classifier within and between the major time interval using the TCGA FPKM.UQ data. We found 118 samples that meet this criteria.
SummarizedExperiment::colData(read.cancer.se)$cms.use <- 'no'
a <- as.character(
SummarizedExperiment::colData(read.cancer.se)$cms.cancer.fpkmUq
)
b <- as.character(
SummarizedExperiment::colData(read.cancer.se)$cms.cancer.time.points.fpkmUq
)
SummarizedExperiment::colData(read.cancer.se)$cms.use[a==b] <- 'yes'
### Data and gene annot
index.sample.use <- SummarizedExperiment::colData(
read.cancer.se)$cms.cancer.fpkmUq!='Not classified' &
SummarizedExperiment::colData(read.cancer.se)$cms.use == 'yes'Then, we apply ANOVA on individual genes expression with the CMS being a factor using the TCGA FPKM.UQ data.
ftest.cms <- .Ftest(
data = as.matrix(SummarizedExperiment::assay(
read.cancer.se[ , index.sample.use],
'HTseq_FPKM.UQ')
),
variable = droplevels(
SummarizedExperiment::colData(
read.cancer.se
)$cms.cancer.fpkmUq[index.sample.use]),
is.log = FALSE,
n.cores = n.cores)We rank genes based on their F-statistics and select genes that have the rank below 1000. These genes most likely do not capture the CMS variation. In addition, we remove genes that are on Y chromosome. Finally, we end up with 997 genes as a potential set of negative control genes for RUV-III normalization.
SummarizedExperiment::rowData(read.cancer.se)$cmsDE.genes <- rank(
ftest.cms$FValue)
negative.control.genes <- SummarizedExperiment::rowData(
read.cancer.se
)$cmsDE.genes < 1000 &
SummarizedExperiment::rowData(
read.cancer.se
)$chromosome_name_BioMart != 'Y'4.1.1 Assessments of negative control genes
We perform PCA on the raw counts of the TCGA READ RNA-seq data using only the selected negative control genes to assess their performance (Figure 4.1). Ideally, they should capture the library size differences, but not much of variation related to the CMS. Note that, the RUV-III method is robust to negative control genes, but not always. Figure 4.1 shows that the selected negative control genes capture the library size differences and they do not capture variation related to the CMS.
pca.ncg <- .pca(
data = as.matrix(
SummarizedExperiment::assay(
read.cancer.se[negative.control.genes , ],
'HTseq_counts')
),
is.log = FALSE
)
p1 <- .scatter.density.pc(
pcs = pca.ncg$sing.val$u[, 1:3],
pc.var = pca.ncg$var,
group.name = 'Time (years)',
group = SummarizedExperiment::colData(read.cancer.se)$time.points,
color = major.times.colors,
strokeSize = .2,
pointSize = 3,
strokeColor = 'gray30',
alpha = .5
)
p2 <- .scatter.density.pc(
pcs = pca.ncg$sing.val$u[, 1:3],
pc.var = pca.ncg$var,
group.name = 'CMS',
group = SummarizedExperiment::colData(read.cancer.se)$cms.cancer.fpkmUq,
color = cms.colors,
strokeSize = .2,
pointSize = 3,
strokeColor = 'gray30',
alpha = .5
)
do.call(
gridExtra::grid.arrange,
c(p1, p2, ncol = 4))Figure 4.1: PCA plots of the raw counts of the TCGA READ RNA-seq data using only the negative control genes (997 genes).
4.2 Pseudo-replicates of pseudo-samples (PRPS)
As we have mentioned above, the RUV-III method also requires technical replicates to estimate one aspects of unwanted variation. Here we propose a new approach, pseudo-replicates of pseudo-samples (PRPS), for deploying RUV-III method to remove unwanted variation from the TCGA READ RNA-seq data. This approach requires at least one roughly known biologically homogeneous subclass of samples shared across the sources of unwanted variation. To create PRPS, we first need to identify the sources of unwanted variation, which we will call batches in the data. Then the gene expression measurements of biologically homogeneous sets of samples are averaged within batches, and the results called pseudo-samples. Since the variation between pseudo-samples in different batches is mainly unwanted variation, by defining them as pseudo-replicates and used in RUV-III as replicates, we can easily and effectively remove the unwanted variation.
4.2.1 Selection of biological populations to create PRPS
To select homogeneous biological populations, we consider two major biological factors:
1. The CMS subtypes that were obtained by applying the CMS classifier across and within the major time intervals using the TCGA FPKM.UQ data.
2. The MSI status.
So, the 11 combinations (we do not have CMS4_MSI-H) of the 4 CMS, and the 3 MSI statuses were considered to be homogeneous biological populations for the purpose of creating PRPS.
In general, in the TCGA RNA-seq data, plates are completely confounded with times, making it difficult to distinguish plate effects from time effects. So, We consider plates as batches to create pseudo-samples. Note that, we need to make sure that our pseudo-samples span across the major time intervals, otherwise we will not be able to remove the library size differences in the data.
samples.to.use <-
read.cancer.se$cms.cancer.fpkmUq != 'Not classified' &
read.cancer.se$msi.status != 'Indeterminate' &
read.cancer.se$cms.use == 'yes'
sample.info <- droplevels(
as.data.frame(SummarizedExperiment::colData(read.cancer.se[ , samples.to.use]))
)
raw.counts <- as.data.frame(
SummarizedExperiment::assay(read.cancer.se[ , samples.to.use] , 'HTseq_counts')
)
read.prps <- tcgaCleaneR::createPRPS(
expr.data = as.data.frame(
SummarizedExperiment::assay(read.cancer.se[, samples.to.use] , 'HTseq_counts')
),
sample.info = droplevels(as.data.frame(
SummarizedExperiment::colData(read.cancer.se[, samples.to.use])
)),
batch = 'PlateId_mda',
biology = c('cms.cancer.fpkmUq', 'msi.status'),
purity = FALSE,
include.ls = FALSE,
include.purity = FALSE,
minSamplesPerBatchPS = 2,
minSamplesForPuirtyPS = 3,
minSamplesForPurityPerBiology = 6,
minSamplesForLibrarySizePS = 3,
minSamplesForLibrarySizePerBatch = 12)